About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

Hi, I am Wenhsuan!

I hope that I could learn some useful techique with R, and that I could analysis data after the semester ends.

*Github repository


Regression and model validation

Describe the work you have done this week and summarize your learning.

The is a data with 60 variables. Through analyzing the data, we hope to undertand what are the important variables which is related to exam points.

step 1:Data Cleaning To analyze the data, the first step is to clean the data (scale the “Attitude” column), and select the information we are interested in. Since there are too many variables (183 observations and 60 variables), and this would make it hard to make analysis, I combine some of the variables, and put them in three big categories, which are deep, surface, strategic. After that, I average the value of deep_columns, surface_columns, strategic_columns.In the end, I keep the rows where points is greater than zero.These are what I do in “data cleaning” step.

step 2: Show a graphical overview of the data and show summaries of the variables in the data.

## Loading required package: ggplot2

##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

After data cleaning step, the data now has 166 observations and 7 variables and I start drawing some plots.

The plots show us the distribution between two variables, and each varible. They also show us data distribytion by gender. I found that there is a positive correlation between attitude and points (0.43), deep and surf has negative correlation. From the box charts, I found that the data of age, deep questions is more condensed. However, the value age has lots of outliers.

From the summary, we see below information, which shows the minimum, maximum, mean, quartile value of each variable(age, attitude, deep)

step 3:Choose attitude, deep question, strategic question variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
## 
## Coefficients:
## (Intercept)     attitude         deep         stra  
##     11.3915       3.5254      -0.7492       0.9621
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

I choose points as Y value, and attitude, deep question, strategic question as X value to create a multiple regression. According to the summary result, we can see that p-value is 2.521e-08, which is smaller than 0.05, hence we can say that the model is reasonable. However, since standard error is quite large, the estimation isn’t too precise. Though multiple R-squared, adjusted R-squared are both low, which are 0.2097, 0.195, we couldn’t underestimate the model’s explanatory ability, since lots of factors should be taken into account, and that R-squared isn’t the only element to consider a regresison model.

step 4: Produce Residuals vs Fitted plot, Normal QQ-plot and Residuals vs Leverage plot.

1.Residuals vs Fitted plot: A “good” residuals vs. fitted plot should has no obvious outliers, and be generally symmetrically distributed around the 0 line without particularly large residuals. From the plot, we can see that X and Y values are not correlated, hence this is a suitable model.

2.Normal QQ-plot: According to the theory, if both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight. Also, the points should fall approximately along the 45 degree reference line. From the plot, we could see that the points fall approximately on the 45-degree reference line, which means that the data sets come from similar distributions.

3.Residuals vs Leverage plot: This plot helps identify influential data points on the model. The points which are outside the red dashed Cook’s distance line are the points that would be influential in the model, and removing them would likely noticeably alter the regression results.


Logistic regression

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This is adata discussing about students’ alchohol consumption.There are 382 observations and 35 variables. Variables include student’s, sex, age,family size,alcohol consumption, parents’education status, job, and so on. Through the analysis, I want to study the relationships between high/low alcohol consumption and some of the other variables in the data.

I assume that “studytime”(weekly study time), “failures”(number of past class failures), “goout”(frequency of going out with friends), “freetime”(free time after school), are important variables which have strong relationship with the consumption of alchohol. My hypothesis is that students who have less study time per week, fail more classes, go out with friends more often, have more free time after school consume more alchohol.

1.Numerically and graphically explore the distributions

## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_study_time
##   <fct> <lgl>    <int>           <dbl>
## 1 F     FALSE      157            2.34
## 2 F     TRUE        41            2   
## 3 M     FALSE      113            1.88
## 4 M     TRUE        71            1.62
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_failures
##   <fct> <lgl>    <int>         <dbl>
## 1 F     FALSE      157         0.204
## 2 F     TRUE        41         0.439
## 3 M     FALSE      113         0.239
## 4 M     TRUE        71         0.479

According to the summary statistics of study time group by sex and high_use, female who has shorter study time tend to consume more alchohol (more than 2 times per week), and those who studies longer tend not to consume that much alchohol(less than 2 times per week). The phenomenon is same for male. The result corresponds to my assumption.

Through the summary statistics of failures group by sex and high_use, female who failed more classes in the past tend to consume more alchohol, and those who failed less classes tend not to consume that much alchohol. This also happens at male. The result corresponds to my assumption.

The boxplots shows that for variable “goout”, female who consumes more alchohol goes out more. Male also has the same same situation. The result corresponds to what i assumed. And for variable “freetime”, female who consumes more alchohol has more freetime after school; however, the phenomenon is not too significant for male. The result is similiar to my assumption, but doesn’t exactly match.

2.Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

I choose “studytime”, “failures”, “goout”, “freetime” as four X variables, and fit them into the model which Y is “high_use” (high/low alcohol consumption). I also separate the data into male and female, in order to dig more into the data. Below are what I found from the model.

## 
## Call:
## glm(formula = high_use ~ studytime + failures + goout + freetime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8214  -0.7528  -0.5442   0.8552   2.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.36957    0.62399  -3.797 0.000146 ***
## studytime   -0.57481    0.16784  -3.425 0.000615 ***
## failures     0.19303    0.16899   1.142 0.253334    
## goout        0.70490    0.12039   5.855 4.77e-09 ***
## freetime     0.07209    0.13531   0.533 0.594163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 395.17  on 377  degrees of freedom
## AIC: 405.17
## 
## Number of Fisher Scoring iterations: 4
## (Intercept)   studytime    failures       goout    freetime 
## -2.36956938 -0.57481413  0.19303395  0.70489610  0.07209276
## Waiting for profiling to be done...
##                     OR     2.5 %    97.5 %
## (Intercept) 0.09352099 0.0267779 0.3109179
## studytime   0.56280947 0.4007339 0.7752293
## failures    1.21292398 0.8699068 1.6929038
## goout       2.02363642 1.6081702 2.5811264
## freetime    1.07475503 0.8240334 1.4026604

The summary shows that standard error for “studytime”, “failures”, “goout”, “freetime” are 0.79, 0.79, 0.74, 0.75, which are comparatively small. Smaller standard error indicates that sample mean and the population mean is more similar, which means that sample data has a stronger explanatory power to the population. The coefficient for “studytime”, “failures”, “goout”, “freetime” are -2.94, -2.18, -1.67, -2.3, which means that these variables are all highly correlated with Y(“high_use”).

The odds ratio of “studytime” is 0.654(less than 1), “failures” is 1.34(higher than 1), “goout” is 2.114(higher than 1), “freetime” is 1.164(higher than 1), which means that “failures”, “goout”, “freetime” is positively associated with “high_use”. According to the confidence intervals,

According to the above result, I now figure out that “failures”, “goout”, “freetime” are important variables, which are correlated with high/low alcohol consumption. Since the odds ratio of “studytime” is less than one, it isn’t a factor which has high relationship with high/low alcohol consumption.

3.Using the variables which has statistical relationship with high/low alcohol consumption to explore the predictive power of you model.

##         prediction
## high_use FALSE TRUE
##    FALSE   248   22
##    TRUE     76   36

Since I found that “studytime” isn’t an important factor, I remove this variable, and predict a model. According to the confusion matrix, we can calculate the precision is 248/(248+76) = 0.77, and the recall is 248/ (248+22) = 0.92. This means that the model has high recall, low precision, which means that most of the positive examples are correctly recognized, but there are a lot of false positives.The average number of wrong predictions in training data is 0.2565445, and the average number of wrong predictions in the cross validation 0.2486911. This means that the error of the model is quite high(around 25%).


Dimensionality Reduction Techniques

1.Show a graphical overview of the data and show summaries of the variables in the data.

## corrplot 0.84 loaded
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Human data includes 155 observations and 8 variables. The variables are “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”.

GGpairs shows the correlation between two variables, and I find that “Ado.Birth” and “Edu.Exp”, “Ado.Birth” and “Life.Exp”, “Mat.Mor” and “Edu.Exp”, “Mat.Mor” and “Life.Exp” have high negative correlation; “Life.Exp” and “Edu.Exp”, “Ado.Birth” and “Mat.Mor”have high positive correlation.

Corplot shows a better visualization than ggpairs, it shows the correaltion between each two variabes with different color. The two variables are more negative correlated if the color is red, and they are more positive correlated if the color is blue. However, corplot only shows general relationship between two variables, it can’t show the exact correlation.

2.Perform principal component analysis (PCA) on the not standardized human data.

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Since the variables are not standarized, the standard deviation is big, which means the value of the data is discreted. From the biplot, we can see that most data of the countries cluster together, the variables also have the same problem.

3.Standardize the variables in the human data and repeat the above analysis. Are the results different? Why or why not?

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

The result before and after the data is standardized is different. Before the data is standardized, countries and other variables all gather together, which makes it hard to interpret; after the data is standardized, countries are more evenly distributed, and the other variables have more similar standard deviations( since the length of the arrows are almost the same).

4.Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

After the human data is standardized, countries are distributed more evenly. The arrows shows the connections between the original features and the PC’s(PC1, PC2). The countries are placed on x and y coordinates defined by two PC’s. The angle between arrows represents the correlation between the features. Small angle = high positive correlation. We can see that except the correlation between “Parli.F”, “Labo.FM” and PC1, PC2, other variables all have high positive correlation with the PC’s.

The length of the arrows are proportional to the standard deviations of the features, from the plot, we can see that the variables have similar standard deviations.

5.Look at the structure and the dimensions of the tea data and visualize it. Interpret the results of the MCA and draw at least the variable biplot of the analysis.

Tea dataset includes 300 observations and 6 variables,which are:

“Tea”" : Factor 3 levels “black”,“Earl Grey”, “green” “How”" : Factor 4 levels “alone”,“lemon”, “milk”, “other” “how” : Factor 3 levels “tea bag”,“tea bag+unpackaged”, “unpackaged” “sugar” : Factor 2 levels “No.sugar”,“sugar” “where” : Factor 3 levels “chain store”, “chain store+tea shop”, “tea shop”
“lunch” : Factor 2 levels “lunch”,“Not.lunch”

The summary shows the detail of the data. It shows the amount of each variables as below:

Tea: How: how: sugar:
black : 74 alone:195 tea bag :170 No.sugar:155
Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
green : 33 milk : 63 unpackaged : 36
other: 9

where: lunch:
chain store :192 lunch : 44
chain store+tea shop: 78 Not.lunch:256
tea shop : 30

In tea variable, most data are “Earl Grey”(193); in How variable, most data are “alone”(195); in how variable, most data are “tea bag”(170); in sugar variable, most data are “No.sugar”(155); in where variable, most data are “chain store”(192); in lunch variable, most data are “Not.lunch”(256)

## Warning: attributes are not identical across measure variables;
## they will be dropped

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

The visualization of the dataset visualize the summary, thus is easier for me to interpret the data. Next, I do multiple correspondence analysis. The summary shows the eigenvalues, individuals, categories and categorical variables. According to eigenvalues, we can see that Dim.1 and Dim.2 retain more percentage of variances than other dimensions. From v.test value in categories, the coordinate of “black”, “Earl Grey”, “green”, “lemon”, “milk”, “tea bag”, “tea bag+unpackaged”, “unpackaged” is significantly different from zero (since the value is below/above ± 1.96). According to categorical variables, we can see that “how” and “Dim.1”, “where” and “Dim.1”, have a stronger correlation.

MCA biplot shows the possible variable pattern. The distance between variables show the similarity between variables. For example, “lemon” and “alone” are more similar than “lemon” and “other”.